Some NHANES National Youth Fitness Survey (2012) data
Exploration and Initial Data Summaries
Dealing with Missingness then Partitioning
How might we transform our outcome?
Building Candidate Prediction Models
Checking Regression Assumptions
Assessing the candidates, in training and test samples
Today’s Packages
library(janitor) # clean_names(), tabyl(), etc. library(naniar) # identifying/dealing with NAlibrary(patchwork) # combining ggplot2 plotslibrary(broom) # for tidy(), glance(), augment()library(car) # for boxCoxlibrary(corrr) # for correlation matrixlibrary(GGally) # for scatterplot matrixlibrary(ggrepel) # help with residual plotslibrary(gt) # formatting tableslibrary(gtsummary) # for tbl_regression() library(mosaic) # favstats, df_stats & inspectlibrary(sessioninfo) # for session_info()library(simputation) # for single imputationlibrary(tidyverse)theme_set(theme_bw())options(tidyverse.quiet =TRUE)options(dplyr.summarise.inform =FALSE)
I don’t want to do single imputation on the outcome, and it seems we can still assume MCAR, so we’ll filter to complete cases on our outcome, and then impute the one remaining missing waist value.
nnyfs <- nnyfs1 |>filter(complete.cases(triceps)) |># don't want to impute outcomeimpute_rlm(waist ~ age + triceps + health + asthma)miss_case_table(nnyfs)
mod_0 <-lm(triceps ~ waist + age + asthma + health, data = nnyfs)boxCox(mod_0)
Power Transformation suggestions
summary(powerTransform(mod_0))
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 0.3167 0.33 0.2252 0.4083
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 43.93977 1 3.3864e-11
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 230.6417 1 < 2.22e-16
I guess we could use the cube root transformation: \(\mbox{triceps}^{1/3}\) but I’ll go with \(\sqrt{\mbox{triceps}}\) as a starting point, because it’s somehow less daunting.
DTDP: \(\sqrt{\mbox{triceps}}\)
p1 <-ggplot(nnyfs_train, aes(x =sqrt(triceps))) +geom_histogram(bins =20, fill ="slateblue", col ="white")p2 <-ggplot(nnyfs_train, aes(sample =sqrt(triceps))) +geom_qq(col ="slateblue") +geom_qq_line(col ="violetred") +labs(y ="Observed sqrt(triceps)", x ="Normal (0,1) quantiles") +theme(aspect.ratio =1)p3 <-ggplot(nnyfs_train, aes(x ="", y =sqrt(triceps))) +geom_violin(fill ="slateblue", alpha =0.1) +geom_boxplot(fill ="slateblue", width =0.3, notch =TRUE,outlier.color ="slateblue", outlier.size =3) +labs(x ="") +coord_flip()p1 + p2 - p3 +plot_layout(ncol =1, height =c(3, 2)) +plot_annotation(title ="Square Root of Triceps Skinfold (mm)",subtitle =str_glue("Model Development Sample: ", nrow(nnyfs_train), " children in NNYFS 2012"))
DTDP: \(\sqrt{\mbox{triceps}}\)
DTDP: Is \(\mbox{triceps}^{1/3}\) much better?
p1 <-ggplot(nnyfs_train, aes(x = triceps^(1/3))) +geom_histogram(bins =20, fill ="royalblue", col ="white")p2 <-ggplot(nnyfs_train, aes(sample = triceps^(1/3))) +geom_qq(col ="royalblue") +geom_qq_line(col ="magenta") +labs(y ="Observed sqrt(triceps)", x ="Normal (0,1) quantiles") +theme(aspect.ratio =1)p3 <-ggplot(nnyfs_train, aes(x ="", y = triceps^(1/3))) +geom_violin(fill ="royalblue", alpha =0.1) +geom_boxplot(fill ="royalblue", width =0.3, notch =TRUE,outlier.color ="royalblue", outlier.size =3) +labs(x ="") +coord_flip()p1 + p2 - p3 +plot_layout(ncol =1, height =c(3, 2)) +plot_annotation(title ="Cube Root of Triceps Skinfold (mm)",subtitle =str_glue("Model Development Sample: ", nrow(nnyfs_train), " children in NNYFS 2012"))
DTDP: Is \(\mbox{triceps}^{1/3}\) much better?
Identify and Fit Candidate Models in training sample
Scatterplot Matrix
Again, I select the outcome (\(\sqrt{\mbox{triceps}}\)) last, so the bottom row shows the most important scatterplots.
temp <- nnyfs_train |>mutate(sqrt_tri =sqrt(triceps)) |>select(waist, age, asthma, health, sqrt_tri)ggpairs(temp, title ="Scatterplots: Model Development Sample",lower =list(combo =wrap("facethist", bins =20)))
Scatterplot Matrix
What about the categories?
What does our outcome look like by asthma category?
vif(lm(triceps ~ waist + age + asthma + health, data = nnyfs))
GVIF Df GVIF^(1/(2*Df))
waist 1.890109 1 1.374813
age 1.856713 1 1.362612
asthma 1.069957 1 1.034387
health 1.081987 2 1.019895
Any major concerns here? What are we looking for?
Use step to pick a candidate?
We’ll use backwards selection, starting with a model including all four predictors (the “kitchen sink” model).
step(lm(sqrt(triceps) ~ waist + age + asthma + health, data = nnyfs_train))
Use step to pick a candidate?
Start: AIC=-1483.18
sqrt(triceps) ~ waist + age + asthma + health
Df Sum of Sq RSS AIC
- asthma 1 0.02 224.23 -1485.10
<none> 224.21 -1483.18
- health 2 2.92 227.13 -1474.22
- age 1 41.72 265.93 -1314.52
- waist 1 386.76 610.97 -482.71
Step: AIC=-1485.1
sqrt(triceps) ~ waist + age + health
Df Sum of Sq RSS AIC
<none> 224.23 -1485.1
- health 2 2.94 227.17 -1476.1
- age 1 41.74 265.97 -1316.4
- waist 1 387.24 611.46 -483.9
Call:
lm(formula = sqrt(triceps) ~ waist + age + health, data = nnyfs_train)
Coefficients:
(Intercept) waist age healthVeryGood healthOther
0.46543 0.05709 -0.07535 0.12337 0.00586
An Important Point
Stepwise regression lands on our mod_1, as it turns out.
There is a huge amount of evidence that variable selection causes severe problems in estimation and inference.
Stepwise regression is an especially bad choice.
Disappointingly, there really isn’t a great choice. The task itself just isn’t one we can do well in a uniform way across all of the different types of regression models we’ll build.
More on this in 432.
Which models will we fit?
Model
waist
age
asthma
health
modA
Yes
No
No
No
modB
Yes
Yes
No
No
modC
Yes
Yes
No
Yes
modD
Yes
Yes
Yes
Yes
Stepwise backwards selection via AIC suggested modC.
“Kitchen Sink” is modD
Today, each model is a subset of the next model.
Fitting modA through modD
Using the training sample…
modA <-lm(sqrt(triceps) ~ waist, data = nnyfs_train)modB <-lm(sqrt(triceps) ~ waist + age, data = nnyfs_train)modC <-lm(sqrt(triceps) ~ waist + age + health, data = nnyfs_train)modD <-lm(sqrt(triceps) ~ waist + age + asthma + health, data = nnyfs_train)
Scatterplot with modA
ggplot(data = nnyfs_train, aes(x = waist, y =sqrt(triceps))) +geom_point(col ="slateblue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="Waist Circumference (cm)", y ="Square Root of Triceps Skinfold (mm)", title ="modA and our data", subtitle ="nnyfs Training Sample (n = 1000)", caption ="R-squared = 0.623") +theme(aspect.ratio =1)
Regression Assumptions & Residual Plots via ggplot2
Checking Regression Assumptions
Four key assumptions we need to think about:
Linearity
Constant Variance (Homoscedasticity)
Normality
Independence
How do we assess 1, 2, and 3? Residual plots.
Augmenting our training data
aug_A <-augment(modA, data = nnyfs_train) |>mutate(sqrt_tri =sqrt(triceps)) # add in our model's outcomeaug_B <-augment(modB, data = nnyfs_train) |>mutate(sqrt_tri =sqrt(triceps)) # add in our model's outcomeaug_C <-augment(modC, data = nnyfs_train) |>mutate(sqrt_tri =sqrt(triceps)) # add in our model's outcomeaug_D <-augment(modD, data = nnyfs_train) |>mutate(sqrt_tri =sqrt(triceps)) # add in our model's outcome
Residuals vs. Fitted Values: ggplot2
ggplot(aug_A, aes(x = .fitted, y = .resid)) +geom_point() +geom_point(data = aug_A |>slice_max(abs(.resid), n =5),col ="red", size =2) +geom_text_repel(data = aug_A |>slice_max(abs(.resid), n =5),aes(label = SEQN), col ="red") +geom_abline(intercept =0, slope =0, lty ="dashed") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Residuals vs. Fitted Values from modA",caption ="5 largest |residuals| highlighted in red.",x ="Fitted Value of sqrt(triceps)", y ="Residual") +theme(aspect.ratio =1)
Residuals vs. Fitted Values: ggplot2
Standardized Residuals: ggplot2
p1 <-ggplot(aug_A, aes(sample = .std.resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot",y ="Standardized Residual from modA", x ="Standard Normal Quantiles") +theme(aspect.ratio =1)p2 <-ggplot(aug_A, aes(y = .std.resid, x ="")) +geom_violin(fill ="ivory") +geom_boxplot(width =0.3) +labs(title ="Box and Violin Plots",y ="Standardized Residual from modA",x ="modA")p1 + p2 +plot_layout(widths =c(2, 1)) +plot_annotation(title ="Normality of Standardized Residuals from modA",caption =str_glue("n = ", nrow(aug_A |>select(.std.resid))," residual values are plotted here."))
Standardized Residuals: ggplot2
Scale-Location Plot via ggplot2
ggplot(aug_A, aes(x = .fitted, y =sqrt(abs(.std.resid)))) +geom_point() +geom_point(data = aug_A |>slice_max(sqrt(abs(.std.resid)), n =3),col ="red", size =1) +geom_text_repel(data = aug_A |>slice_max(sqrt(abs(.std.resid)), n =3),aes(label = SEQN), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Scale-Location Plot for modA",caption ="3 largest |Standardized Residual| in red.",x ="Fitted Value of sqrt(triceps)", y ="Square Root of |Standardized Residual|") +theme(aspect.ratio =1)
ggplot(aug_A, aes(x = .hat, y = .std.resid)) +geom_point() +geom_point(data = aug_A |>filter(.cooksd >=0.5),col ="red", size =2) +geom_text_repel(data = aug_A |>filter(.cooksd >=0.5),aes(label = SEQN), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +geom_vline(aes(xintercept =3*mean(.hat)), lty ="dashed") +labs(title ="Residuals vs. Leverage from modA",caption ="Red points indicate Cook's d at least 0.5",x ="Leverage", y ="Standardized Residual") +theme(aspect.ratio =1)
Residuals vs. Leverage Plot via ggplot2
Some Notes on the Residuals vs. Leverage Plot
In this ggplot() approach,
Points with Cook’s d >= 0.5 would be highlighted and in red, if there were any.
Points right of the dashed line have high leverage have more than 3 times the average leverage, and so are identified as highly leveraged by some people.
ggplot2 Residual Plots: modA
p1 <-ggplot(aug_A, aes(x = .fitted, y = .resid)) +geom_point() +geom_point(data = aug_A |>slice_max(abs(.resid), n =3),col ="red", size =2) +geom_text_repel(data = aug_A |>slice_max(abs(.resid), n =3),aes(label = SEQN), col ="red") +geom_abline(intercept =0, slope =0, lty ="dashed") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Residuals vs. Fitted",x ="Fitted Value of sqrt(triceps)", y ="Residual") p2 <-ggplot(aug_A, aes(sample = .std.resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot",y ="Standardized Residual", x ="Standard Normal Quantiles") p3 <-ggplot(aug_A, aes(x = .fitted, y =sqrt(abs(.std.resid)))) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Scale-Location Plot",x ="Fitted Value of sqrt(triceps)", y ="|Std. Residual|^(1/2)") p4 <-ggplot(aug_A, aes(x = .hat, y = .std.resid)) +geom_point() +geom_point(data = aug_A |>filter(.cooksd >=0.5),col ="red", size =2) +geom_text_repel(data = aug_A |>filter(.cooksd >=0.5),aes(label = SEQN), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +geom_vline(aes(xintercept =3*mean(.hat)), lty ="dashed") +labs(title ="Residuals vs. Leverage",x ="Leverage", y ="Standardized Residual") (p1 + p2) / (p3 + p4) +plot_annotation(title ="Assessing Residuals for modA",caption ="If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")
ggplot2 Residual Plots: modA
ggplot2 Residual Plots: modB
p1 <-ggplot(aug_B, aes(x = .fitted, y = .resid)) +geom_point() +geom_point(data = aug_B |>slice_max(abs(.resid), n =3),col ="red", size =2) +geom_text_repel(data = aug_B |>slice_max(abs(.resid), n =3),aes(label = SEQN), col ="red") +geom_abline(intercept =0, slope =0, lty ="dashed") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Residuals vs. Fitted",x ="Fitted Value of sqrt(triceps)", y ="Residual") p2 <-ggplot(aug_B, aes(sample = .std.resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot",y ="Standardized Residual", x ="Standard Normal Quantiles") p3 <-ggplot(aug_B, aes(x = .fitted, y =sqrt(abs(.std.resid)))) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Scale-Location Plot",x ="Fitted Value of sqrt(triceps)", y ="|Std. Residual|^(1/2)") p4 <-ggplot(aug_B, aes(x = .hat, y = .std.resid)) +geom_point() +geom_point(data = aug_B |>filter(.cooksd >=0.5),col ="red", size =2) +geom_text_repel(data = aug_B |>filter(.cooksd >=0.5),aes(label = SEQN), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +geom_vline(aes(xintercept =3*mean(.hat)), lty ="dashed") +labs(title ="Residuals vs. Leverage",x ="Leverage", y ="Standardized Residual") (p1 + p2) / (p3 + p4) +plot_annotation(title ="Assessing Residuals for modB",caption ="If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")
ggplot2 Residual Plots: modB
ggplot2 Residual Plots: modC
p1 <-ggplot(aug_C, aes(x = .fitted, y = .resid)) +geom_point() +geom_point(data = aug_C |>slice_max(abs(.resid), n =3),col ="red", size =2) +geom_text_repel(data = aug_C |>slice_max(abs(.resid), n =3),aes(label = SEQN), col ="red") +geom_abline(intercept =0, slope =0, lty ="dashed") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Residuals vs. Fitted",x ="Fitted Value of sqrt(triceps)", y ="Residual") p2 <-ggplot(aug_C, aes(sample = .std.resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot",y ="Standardized Residual", x ="Standard Normal Quantiles") p3 <-ggplot(aug_C, aes(x = .fitted, y =sqrt(abs(.std.resid)))) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Scale-Location Plot",x ="Fitted Value of sqrt(triceps)", y ="|Std. Residual|^(1/2)") p4 <-ggplot(aug_C, aes(x = .hat, y = .std.resid)) +geom_point() +geom_point(data = aug_C |>filter(.cooksd >=0.5),col ="red", size =2) +geom_text_repel(data = aug_C |>filter(.cooksd >=0.5),aes(label = SEQN), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +geom_vline(aes(xintercept =3*mean(.hat)), lty ="dashed") +labs(title ="Residuals vs. Leverage",x ="Leverage", y ="Standardized Residual") (p1 + p2) / (p3 + p4) +plot_annotation(title ="Assessing Residuals for modC",caption ="If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")
ggplot2 Residual Plots: modC
ggplot2 Residual Plots: modD
p1 <-ggplot(aug_D, aes(x = .fitted, y = .resid)) +geom_point() +geom_point(data = aug_D |>slice_max(abs(.resid), n =3),col ="red", size =2) +geom_text_repel(data = aug_D |>slice_max(abs(.resid), n =3),aes(label = SEQN), col ="red") +geom_abline(intercept =0, slope =0, lty ="dashed") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Residuals vs. Fitted",x ="Fitted Value of sqrt(triceps)", y ="Residual") p2 <-ggplot(aug_D, aes(sample = .std.resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot",y ="Standardized Residual", x ="Standard Normal Quantiles") p3 <-ggplot(aug_D, aes(x = .fitted, y =sqrt(abs(.std.resid)))) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Scale-Location Plot",x ="Fitted Value of sqrt(triceps)", y ="|Std. Residual|^(1/2)") p4 <-ggplot(aug_D, aes(x = .hat, y = .std.resid)) +geom_point() +geom_point(data = aug_D |>filter(.cooksd >=0.5),col ="red", size =2) +geom_text_repel(data = aug_D |>filter(.cooksd >=0.5),aes(label = SEQN), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +geom_vline(aes(xintercept =3*mean(.hat)), lty ="dashed") +labs(title ="Residuals vs. Leverage",x ="Leverage", y ="Standardized Residual") (p1 + p2) / (p3 + p4) +plot_annotation(title ="Assessing Residuals for modD",caption ="If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")
ggplot2 Residual Plots: modD
Assessing the Candidates (in the test sample)
modA prediction errors in test sample
Since the way to back out of the square root transformation is to take the square, we will square the fitted values provided by augment and then calculate residuals on the original scale, as follows…
We fit modA, modB, modC and modD to predict \(\sqrt{triceps}\) after deleting 21 subjects with missing triceps and imputing (singly) one missing weight, yielding 1497 subjects.
Training Sample: differences between models not large
modC has best adjusted \(R^2\), \(\hat{\sigma}\) and AIC
modB has best BIC, modD (of course) has best \(R^2\)
Residual plots: Linearity and Normality look OK.
Maybe an increasing trend in scale-location plot.
No signs of meaningful collinearity in training sample.
Overall Conclusions, 2
Test Sample: differences also not especially large
modB has smallest MAPE, RMSPE, largest validated \(R^2\)
modA has smallest maximum error
modC has smallest median error
I’d probably go with modB (emphasize predictive performance) but there’s a case for some other choices.
Training sample: modB (waist + age) shows \(R^2\) = 0.682